if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("GEOmetadb", quietly = TRUE))
BiocManager::install("GEOmetadb")
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
if (!requireNamespace("edgeR", quietly = TRUE))
BiocManager::install("edgeR")
if (!requireNamespace("ggplot2", quietly = TRUE))
install.packages("ggplot2")
if (!requireNamespace("reshape2", quietly = TRUE))
BiocManager::install("reshape2")
if (!requireNamespace("HGNChelper", quietly = TRUE))
BiocManager::install("HGNChelper")
if (!requireNamespace("limma", quietly = TRUE))
BiocManager::install("limma")
if (!requireNamespace("knitr", quietly = TRUE))
BiocManager::install("knitr")
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
install.packages("ComplexHeatmap")
if (!requireNamespace("circlize", quietly = TRUE))
install.packages("circlize")
if (!requireNamespace("Biobase", quietly = TRUE))
BiocManager::install("Biobase")
Get the data for my chosen GEO id. I chose to work with the data set from an experiment that uses RNA-sequencing to study stem cell derived cortical excitatory (cExN) and cortical inhibitory (cIN) neurons from three individuals with varying affectation of autism syndrome disorder (ASD) (Lewis, Meganathan and Baldridge 2019).
myGEOID <- 'GSE129806'
suppFiles = GEOquery::getGEOSuppFiles(myGEOID)
This portion summarizes the data preprocessing and normalziation steps we did in assignment 1.
fileNames = rownames(suppFiles)
data <- read.delim(fileNames[2], check.names=FALSE, header=TRUE)
colnames(data)[1] <- 'gene_id'
na.omit(data)
# Remove dates from excel errors
data <- data[!grepl('^[0-9]{1,2}[-][a-zA-Z]{3}$', data$gene_id),]
rownames(data) <- data$gene_id
replacements <- HGNChelper::checkGeneSymbols(data$gene_id)
## Maps last updated on: Thu Oct 24 12:31:05 2019
## Warning in HGNChelper::checkGeneSymbols(data$gene_id): Human gene symbols should
## be all upper-case except for the 'orf' in open reading frames. The case of some
## letters was corrected.
## Warning in HGNChelper::checkGeneSymbols(data$gene_id): x contains non-approved
## gene symbols
# Update the gene id with the newest HGNC symbols if available
data$gene_id <- ifelse(!replacements$Approved & !is.na(replacements$Suggested.Symbol), replacements$Suggested.Symbol, replacements$x)
# Calculate CPM
cpms = edgeR::cpm(data[,3:34])
rownames(cpms) <- data[,1]
# Remove low counts
keep = rowSums(cpms > 1) >= 3
data_filtered = data[keep,]
rows_removed = dim(data)[1]-dim(data_filtered)[1]
# Based on code from lecture 4
samples <- data.frame(lapply(colnames(data)[3:34], FUN=function(x){unlist(strsplit(x, split="_"))[c(1,2)]}))
colnames(samples) = colnames(data)[3:34]
rownames(samples) = c("patients", "cell_type")
samples <- data.frame(t(samples))
# Based on code from lecture 4
data_filtered_matrix <- as.matrix(data_filtered[,3:34])
rownames(data_filtered_matrix) <- data_filtered$gene_id
d = edgeR::DGEList(counts=data_filtered_matrix, group=samples$cell_type)
d = edgeR::calcNormFactors(d)
normalized_counts <- edgeR::cpm(d)
The purpose of this section is to conduct differential expression analysis and rank genes with the normalized expression set from assignment 1.
In assignment 1, we downloaded the data set of interest. The data set is from an experiment that uses RNA-sequencing to study stem cell derived cortical excitatory (cExN) and cortical inhibitory (cIN) neurons from three individuals with varying affectation of autism syndrome disorder (ASD) (Lewis, Meganathan and Baldridge 2019). To explore the data, the null values, mistakes (dates instead of genes) and outliers were removed. The data was then normalized using TMM normalization. The pre-normalization and post-normalization data was graphed out using boxplots and density curves.
From the MDS plot we can see two things. One is the significant dissimilarity between the two cell types. In dark green we have the cortical inhibitory neuron cells and in the blue we have the cortical excitatory neuron cells. As well we can see the that both cell types of the affected proband (AP) are clustered near the top. Both cell types of the mother (UM) and sibling (IS) of the affected proband are clustered near the middle and the unaffected control cell types are clustered near the bottom. Since we’re interested in seeing which genes may be attributed to autism syndrome disorder, we’ll focus our analysis on which patient (UM, IS, AP or UC) the cells come from.
limma::plotMDS(d, labels=rownames(samples), col = c("darkgreen", "blue")[factor(samples$cell_type)])
# Take a look at the normalized data from the previous assignment
knitr::kable(normalized_counts[1:5, 1:5], type="html")
| UC_cIN_R1 | UC_cIN_R2 | UC_cIN_R3 | UC_cIN_R4 | AP_cIN_R1 | |
|---|---|---|---|---|---|
| A1BG | 0.7720147 | 0.389810 | 1.0782781 | 0.4301998 | 0.7493442 |
| A1BG-AS1 | 4.7864913 | 4.911606 | 4.8522515 | 5.1193774 | 5.3952779 |
| A2M | 40.3763699 | 57.263093 | 20.7029399 | 34.8461820 | 20.0324672 |
| A2M-AS1 | 2.0458390 | 3.352366 | 4.2052847 | 3.8287780 | 0.7493442 |
| A2ML1 | 0.1544029 | 0.194905 | 0.2695695 | 0.2581199 | 0.0999126 |
# Based off lecture notes
# Convert it to a number matrix
heatmap_matrix <- as.matrix(normalized_counts)
# Scale and centre each row around the mean
heatmap_matrix <- t(scale(t(heatmap_matrix)))
# Create the heat map
if(min(heatmap_matrix) == 0){
heatmap_col = circlize::colorRamp2(c(0, max(heatmap_matrix)),
c("white", "red"))
} else {
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix), 0,
max(heatmap_matrix)), c("blue", "white", "red"))
}
current_heatmap <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix),
show_row_dend = TRUE, show_column_dend = TRUE,
col=heatmap_col, show_column_names = TRUE, show_row_names = FALSE,
show_heatmap_legend = TRUE)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` arugment by explicitly setting
## TRUE/FALSE to it. Set `ht_opt$message = FALSE` to turn off this
## message.
# Takes a while to run
current_heatmap
Now we want to plot the MDS using limma R package. We can tell there are 3 distinct clusters. The cluster to the top left show the cIN neuron cells which all have similar logFC dim1 and dim2 scores. The cluster to the top right shows the affected proband, unaffected mother and intermediate sibling cExN cells, which are distinct from the unaffected control cExN cells. This is useful because we want to see analyze the gene expression in cells from patients with autism syndrome disorder or their relatives.
limma::plotMDS(heatmap_matrix, col = rep(c("darkgreen", "blue"), 10))
To see the data in a different perspective, we can colour by patient and do another MDS plot. We can see that relateively, clustering for patients is better in the cIN cells than the cExN cells. Once again we see that there is significant difference between the gene expression of UC cExN cells and the gene expression of the other 3 patient cells.
# Based off lecture notes
pat_colors <- rainbow(4)
pat_colors <- unlist(lapply(pat_colors,FUN=function(x){rep(x,4)}))
limma::plotMDS(heatmap_matrix, col = pat_colors)
For the next step, we want to create a model matrix and fit the data we have into it. We will focus on using the patient column for analysis since we want to understand differential expression of genes in patients with or related to someone with autism and patients without autism.
# Set model to patients
model_design <- model.matrix(~ samples$patients)
To calculate p-values we will use the limma method.
# Based off lecture notes
# There was an error citing duplicate row names. We will use make.names to make them unique.
rownames(normalized_counts) = make.names(rownames(normalized_counts), unique=TRUE)
# Calculate expression matrix and minimal set
expression_matrix <- as.matrix(normalized_counts)
minimal_set <- Biobase::ExpressionSet(assayData=expression_matrix)
# Fit the data
fit <- limma::lmFit(minimal_set, model_design)
fit2 <- limma::eBayes(fit,trend=TRUE)
topfit <- limma::topTable(fit2,
coef=ncol(model_design),
adjust.method = "BH",
number = nrow(expression_matrix))
output_hits <- merge(rownames(normalized_counts),
topfit,
by.y=0, by.x=1,
all.y=TRUE)
output_hits <- output_hits[order(output_hits$P.Value),]
# Take a look at the top 10 hits
knitr::kable(output_hits[1:10,], type="html")
| x | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|
| 17985 | ZNF257 | 7.113438 | 1.8143439 | 32.22859 | 0 | 0 | 19.84025 |
| 17948 | ZNF208 | 6.565174 | 1.7038129 | 31.64309 | 0 | 0 | 19.77352 |
| 18384 | ZNF835 | 10.623805 | 4.0443022 | 25.35091 | 0 | 0 | 18.79089 |
| 7320 | GAPDHP35 | 4.685038 | 1.3709558 | 22.47889 | 0 | 0 | 18.10365 |
| 4555 | CDYL2 | -30.850848 | 27.4617530 | -22.33288 | 0 | 0 | 18.06297 |
| 133 | AC003973.3 | 3.198164 | 0.8114953 | 18.97549 | 0 | 0 | 16.92398 |
| 2426 | AL365357.1 | 22.052102 | 12.6921468 | 18.51508 | 0 | 0 | 16.73133 |
| 5454 | CYP4F29P | -2.319478 | 0.6142981 | -18.09176 | 0 | 0 | 16.54480 |
| 6235 | EIF1AY | -39.206172 | 10.4449541 | -17.81348 | 0 | 0 | 16.41701 |
| 1826 | ADORA2B | 4.099782 | 3.5184098 | 17.26614 | 0 | 0 | 16.15294 |
output_hits
From the results we can see that 2499 genes had a P value < 0.05.
length(which(output_hits$P.Value < 0.05))
## [1] 2499
Once corrected, that number goes down to 714 genes.
simple_adj_p_values <- length(which(output_hits$adj.P.Val < 0.05))
simple_adj_p_values
## [1] 714
We graph the p-values to see if we have any potential problems. The graph shows that we have a set of good p-values since the bins are highly right skewed and not uniformly distributed.
ggplot2::ggplot(output_hits, ggplot2::aes(x=P.Value)) +
ggplot2::geom_histogram() +
ggplot2::labs(title="P values", y="counts", x="p-value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We now want to improve our results and account for the variability seen in different patients and cell types. We’ll create a design matrix that accounts for both cell type and patient.
model_design_ct <- model.matrix(~ samples$cell_type + samples$patients)
Now we repeat the previous step of fitting the data to our new model.
# Based off lecture notes
fit_ct <- limma::lmFit(minimal_set, model_design_ct)
fit_ct2 <- limma::eBayes(fit_ct,trend=TRUE)
topfit_ct <- limma::topTable(fit_ct2,
coef=ncol(model_design_ct),
adjust.method = "BH",
number = nrow(expression_matrix))
output_hits_ct <- merge(rownames(normalized_counts),
topfit_ct,
by.y=0, by.x=1,
all.y=TRUE)
output_hits_ct <- output_hits_ct[order(output_hits_ct$P.Value),]
# Take a look at the top 10 hits
knitr::kable(output_hits_ct[1:10,], type="html")
| x | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|
| 17985 | ZNF257 | 7.113438 | 1.8143439 | 32.36326 | 0 | 0 | 28.05109 |
| 17948 | ZNF208 | 6.565174 | 1.7038129 | 31.91476 | 0 | 0 | 27.95467 |
| 18384 | ZNF835 | 10.623805 | 4.0443022 | 30.20039 | 0 | 0 | 27.55387 |
| 7320 | GAPDHP35 | 4.685038 | 1.3709558 | 24.23032 | 0 | 0 | 25.62799 |
| 4555 | CDYL2 | -30.850848 | 27.4617530 | -23.89015 | 0 | 0 | 25.48564 |
| 2426 | AL365357.1 | 22.052102 | 12.6921468 | 19.78701 | 0 | 0 | 23.36693 |
| 1895 | AGBL4 | 5.116940 | 2.3194307 | 19.31923 | 0 | 0 | 23.06879 |
| 133 | AC003973.3 | 3.198164 | 0.8114953 | 19.19195 | 0 | 0 | 22.98529 |
| 13022 | PTPRN2 | -56.016087 | 40.4856521 | -19.12102 | 0 | 0 | 22.93830 |
| 1284 | AC104692.2 | 1.721314 | 0.6920458 | 18.71185 | 0 | 0 | 22.66078 |
Again take a look at the p-values. From the results we can see that 5682 genes had a P value < 0.05.
length(which(output_hits_ct$P.Value < 0.05))
## [1] 5682
Once corrected, that number goes down to 3171 genes.
complex_adj_p_values <- length(which(output_hits_ct$adj.P.Val < 0.05))
complex_adj_p_values
## [1] 3171
We graph the p-values to see if we have any potential problems. Again, the graph shows that we have a set of good p-values since the bins are highly right skewed and not uniformly distributed.
ggplot2::ggplot(output_hits_ct, ggplot2::aes(x=P.Value)) +
ggplot2::geom_histogram() +
ggplot2::labs(title="P values", y="counts", x="p-value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now I want to compare the two models I have. Simple model refers to the model that considers patient only while complex model refers to the model with patient and cell type.
# Based off lecture notes
simple_model_pvalues <- data.frame(gene_id = output_hits$x, simple_pvalue = output_hits$P.Value)
complex_model_pvalues <- data.frame(gene_id = output_hits_ct$x, complex_pvalue = output_hits_ct$P.Value)
two_models_pvalues <- merge(simple_model_pvalues, complex_model_pvalues, by.x=1, by.y=1)
two_models_pvalues$colour <- "black"
two_models_pvalues$colour[two_models_pvalues$simple_pvalue<0.05] <- "orange"
two_models_pvalues$colour[two_models_pvalues$complex_pvalue<0.05] <- "blue"
two_models_pvalues$colour[two_models_pvalues$simple_pvalue<0.05 & two_models_pvalues$complex_pvalue<0.05] <- "red"
plot(two_models_pvalues$simple_pvalue, two_models_pvalues$complex_pvalue, col = two_models_pvalues$colour, xlab = "simple model p-values", ylab ="Complex model p-values", main="Simple vs Complex Limma")
legend("topleft", inset=0.1, legend=c("Simple", "Complex", "Both", "Not Significant"),
fill = c("orange", "blue", "red", "black"))
In the above plot, we can see that the simple model which considers patients only has a better fit to the p-values of both models.
We want to include the FMR1 gene in the plot which is an important gene for autism syndrome disorder.
# Based off lecture notes
two_models_pvalues$colour <- "grey"
two_models_pvalues$colour[two_models_pvalues$gene_id == "FMR1"] <- "red"
plot(two_models_pvalues$simple_pvalue, two_models_pvalues$complex_pvalue, col = two_models_pvalues$colour, xlab = "Simple model p-values", ylab ="Complex model p-values", main="Simple vs Complex Limma")
points(two_models_pvalues[which(two_models_pvalues$gene_id == "CHCHD2"), 2:3], pch=2, col="red", cex=0.8 )
legend("topleft", inset=0.1, legend=c("FMR1", "rest"),
fill = c("red", "grey"), cex=0.7)
Now, we will graph a heatmap for the top hits in the model.
# Based off lectures
top_hits <- output_hits$x[output_hits$P.Value < 0.01]
heatmap_matrix_top_hits <- t(scale(t(heatmap_matrix[which(rownames(heatmap_matrix) %in% top_hits),])))
# Do a heatmap for cIN cells
heatmap_matrix_top_hits_cIN <- heatmap_matrix[, grep(colnames(heatmap_matrix_top_hits), pattern = "cIN")]
if(min(heatmap_matrix_top_hits_cIN) == 0) {
heatmap_col = circlize::colorRamp2(c(0, max(heatmap_matrix_top_hits_cIN)),
c("white", "red"))
} else {
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_top_hits_cIN), 0, max(heatmap_matrix_top_hits_cIN)), c("blue", "white", "red"))
}
current_heatmap_cIN <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_top_hits_cIN),
cluster_rows = TRUE, show_row_dend = FALSE,
col=heatmap_col, show_column_names = TRUE,
show_row_names = FALSE, show_heatmap_legend = TRUE)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` arugment by explicitly setting
## TRUE/FALSE to it. Set `ht_opt$message = FALSE` to turn off this
## message.
current_heatmap_cIN
From the cIN cells it is difficult to distinguish clusters that are significantly different.
# Do a heatmap for cExN cells
heatmap_matrix_top_hits_cExN <- heatmap_matrix[, grep(colnames(heatmap_matrix_top_hits), pattern = "cExN")]
if(min(heatmap_matrix_top_hits_cExN) == 0) {
heatmap_col = circlize::colorRamp2(c(0, max(heatmap_matrix_top_hits_cExN)),
c("white", "red"))
} else {
heatmap_col = circlize::colorRamp2(c(min(heatmap_matrix_top_hits_cExN), 0, max(heatmap_matrix_top_hits_cExN)), c("blue", "white", "red"))
}
current_heatmap_cExN <- ComplexHeatmap::Heatmap(as.matrix(heatmap_matrix_top_hits_cExN),
cluster_rows = TRUE, show_row_dend = FALSE,
col=heatmap_col, show_column_names = TRUE,
show_row_names = FALSE, show_heatmap_legend = TRUE)
## `use_raster` is automatically set to TRUE for a matrix with more than
## 2000 rows. You can control `use_raster` arugment by explicitly setting
## TRUE/FALSE to it. Set `ht_opt$message = FALSE` to turn off this
## message.
current_heatmap_cExN
From the cExN cells it is easier to distinguish clusters that are significantly different. The UC cells have brighter blues and brighter reds near the middle. As well, for the 3 members of the affected proband’s family, the top part is more red, while the for the control, it is more blue. Comparing the unaffected mother to the intermediate sibling and affected proband, we see clusters of of darker blue near the top in the mother, but not in the affected proband and intermediate sibling.
Using a volcano plot we can see
volcano_plot <- cbind(output_hits$logFC, -log10(output_hits$P.Val))
colnames(volcano_plot) <- c("logFC", "-log10(adjusted p-value)")
rownames(volcano_plot) <- output_hits$x
plot_colours <- ifelse(output_hits$logFC < 0 & output_hits$P.Val<0.05, "red", ifelse(output_hits$logFC > 0 & output_hits$P.Val < 0.05, "blue", "black"))
plot(volcano_plot, pch = 8, xlim = c(-100, 100), col = plot_colours, cex = 0.05, main = "log2FC vs -log10(adjusted p-value) Graph")
# Genes from paper that are involved in behavior and developmental disorders
points(volcano_plot[c("CHCHD2", "CHL1", "GAS7", "S100A6"),], pch=6,
col=c("orange", "green", "darkblue", "purple"), cex=1 )
legend("topleft", inset=0.1, legend=c("Significantly Underexpressed", "Significantly Overexpressed", "Not Significant"),
fill = c("red", "blue", "black"), cex=0.7)
The volcano graph above shows the genes that are upregulated in blue and downregulated in red. It tells us the distribution of the upregulated vs downregulated. We can see 4 genes mentioned in the paper related to behavior and developmental disorders. 2 of them, GAS7 and S100A6 are overly expressed and 2 of them, CHCHD2 and CHL1 are under expressed (Lewis et al. 2019).
There were 714 genes that were differentially expressed for the simple model considering patient type. There were 3171 genes that were differentially expressed for complex model considering both patient type and cell type. I used the p-value of 0.05 since that is the most common p-value widely accepted.
I chose to use the Benjamini-Hochberg method because it is straightforward and intuitive to use. As well since there are over 20,000 genes to consider, the false discovery rate can cause many false positives. The Benjamini-Hochberg test helps with reducing false positive rate, it in turns help reeduces the number of false positives in my model.
See volcano plot above.
See heatmaps above.
We see there are 1259 genes up regulated.
length(which(output_hits$P.Value < 0.05 & output_hits$logFC > 0))
## [1] 1259
We see there are 1240 genes down regulated.
length(which(output_hits$P.Value < 0.05 & output_hits$logFC < 0))
## [1] 1240
We want to write these genes to a file so we can do over-representation analysis.
output_hits_to_write <- output_hits
# underexpressed genes
underexpressed_genes <- output_hits$x[which(output_hits$P.Value < 0.05 & output_hits$logFC > 0)]
# overexpressed genes
overexpressed_genes <- output_hits$x[which(output_hits$P.Value < 0.05 & output_hits$logFC < 0)]
# all genes
all_genes <- output_hits$x[which(output_hits$P.Value < 0.05 & (output_hits$logFC > 0 | output_hits$logFC < 0))]
# write to files
write.table(x=underexpressed_genes, file="upregulated_genes.txt", sep= "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x=overexpressed_genes, file="downregulated_genes.txt", sep= "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(x=all_genes, file="combined_genes.txt", sep= "\t", row.names = FALSE, col.names = FALSE, quote = FALSE)
Now, we’ll use the G-profiler tool to do our over-representation analysis.
I chose this tool since it’s been introduced to me as a homework asssignment and I’m familiar with it. Furthermore, it’s constantly updated and has connections to several different databases.
I chose to use the GO Biological Processes, Wiki pathways and Reactome as introduced in the homework assignment. These are all databases that have been recently updated (February 2021 for Go, March 2021 for Wiki pathways, December 2020 for Reactome) based on information provided at their websites.
For upregulated genes, 0 genesets from GO:BP, 15 genesets from Reactome, 2 genesets from WP were returned. For downregulated genes, 163 genesets from GO:BP, 34 genesets from Reactome, 9 gensets from WP were returned. For combined genes, 71 genesets from GO:BP, 23 genesets from Reactome, 5 genesets from WP were returned. I used threshold 0.05 with Benjamini-Hochberg for all 3 queries as this is the commonly accepted p-value.
Looking at the results I see that there were very few results returned for upregulated genes, more values returned for combined upregulated and downregulated genes and the most values returned for downregulated genes. This suggests that autism syndrome disorder (ASD) is correlated with the downregulation of many genes.
Figure_1
The genesets for downregulated genes suggests correlation in mitotic processes and cell cycle. This makes sense as ASD involves disruption of cell proliferation and differentiation
Figure_2
The genesets for upregulated genes are few, but 7q11.23 copy number variation syndrome is known to be associated with ASD (Sanders et al. 2011)
Figure_3
The combination of both downregulated and upregulated genes appear to correlate to metabolic process, which I don’t think is strongly correlated with ASD
Yes, the original paper found altered expression of genes related to cell proliferation and growth. In Figure_1, we also found genesets correlated to mitotic processes and cell cycles, which is related to cell profileration and growth. This suggests that the downregulated genes is associated with ASD.
Yes, in Figure_2 there is the geneset of 7q11.23 copy number variation, which is found to be associated with ASD in another paper (Sanders et al. 2011). Also, in another paper found brain maldevelopment which increases risk of ASD has a mechanism involving altered reuglation of the mitotic cycle (Pramparo et al. 2015).